Project 4 :White wine quality analysis by Nasim Khadem

knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE)

First I loaded the wine data and simply just look at the variables in the data. At the later section the structure of the data is examined.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

In the next section it would be nice to look at the summary of each feature :

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

The summary helps me to understand mean and median of each feature and it helps to better understand hills and valleys in upcoming plots..

Univariate Plots Section

I have plotted the histogram of each variables. To get a quick look at the data.

At first glance it can be seen that there is a peak in most of the features. However sugar and sulphates and alcohol and probably pH show multiple peaks which might be indication of multimodal distribution. I will investigate this further.

Alcohol further examintation

Let’s look at alcohol content. I start by looking at the summary of alcohol.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

The median and the mean of the alcohol are close to each other and about 10.50. Let’s try different binwidth to see if there is something interesting

In this plot data is skewed to the right The binwidth is 1. However let’s make the binwidth even smaller.

With the smaller binwidth , it can be seen that the data is probably bi modal , even maybe tri-modal ?

The highest peak is at about 9.4 which is close to the median value. This is probably a bimodal distribution.

## Mode of Alcohol:  9.4

The mode is about 9.4, which describes the peak around 9.4 in the previous plot.

Since data is still skewed let’s do a log 10 transformation.

The plot shows a peak around 10. However this is not the mode of data which is shown in the plot with a value of 9.4. This plot confirms that alcohol is bimodal. The median is very close to to mean with a value of 10.4.

sugar

Let’s look at the sugar summary and mode.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800
## Residual sugar mode:  1.2

Next Let’s plot the histogram of residual sugar with a 0.1 binwidth

The plot only shows the data between 0-20 since the upper part between 20 and max(65) had very few wines. It is not normally distributed and it is over dispersed. in order to correct this , I will do a log10 transformation:

The sugar residuals has a bimodal distribution. The first high peak is at 1.2 ( mode of data). This plot shows that there are two group of wines. one with lower sugar with a peak at 1.2. The Interquartile range for this data which is equal to Q3-Q1 , makes the second highest peak at 8.2 which is also shown on the data.

Sulphate

Sulphate summary and mode are as follow:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800
## sulphates mode:  0.5

Now let’s look at the distribution:

The sulphate is slightly skewed to the right. It has a Mean around 0.48.

pH

pH summary and mode are as follow:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820
## pH mode:  3.14

pH is normally distributed and it shows that wines with a mean of 3.1 and a mode of 3.1.

New Variable creation

Here I would make a new variable called quality group , which will group the wines in three group of high, average and low quality wine. The bins (2,5] is shown in the plot with low quality, bin (5,6] shows average wine group and group (6,9] shows the high quality wine.

Since the data was not evenly distributed I try to choose the bins in a way that makes the group sizes more or less equal. Let’s look at the histogram of the quality and quality group again .

This plot shows the distribution of quality and quality groups. Wines with quality 3, 4 and 9 are sparse. Therefore different grouping was made, in order to make the groups evenly distributed.

Univariate Analysis

What is the structure of your dataset?

The data set is made of 13 variable. The variable X is just the tested wine. There are 12 other variables. Quality variable is a ordinal variable, which scores the wine from 0-10.

What is/are the main feature(s) of interest in your dataset?

Probably the main feature in the wine data set is it’s quality. Since the other features can be used in training a model to estimate the quality of the wine. It can be seen as a classification problem( 10 class of quality). It can also work as a regression model in which we try to predict the quality of the wine. It is interesting to see if there is a correlation between any of the features and the quality. However this does not imply causation.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Other features that was interesting to see for me was , sugar , alcohol , pH and sulphates. I saw irregularities in this features so i plotted them in more details. Alcohol content is one that was of interest. The higher quality wine tend to have more alcohol content. The wines seem to be divided in two groups one with lower sugar content and another group with a higher sugar content.

Did you create any new variables from existing variables in the dataset?

Yes I made a new variable quality_group that groups the quality data into three buckets of low , average and high quality wine. The groups did not have an even distribution.By grouping them in such a way, data is more evenly distributed.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The wine data set was quite tidy. The alcohol had a bi modal distribution. For plotting a couple of log10 transformation was done in order to see the patterns better.

Bivariate Plots Section

In the bivariate section i would like to look at the correlations of features since this will make a guideline for what features to plot and what features are actually influencing the quality of wine. First let’s look if there are any missing values in the Wine data set.

## missing values : 0

Since I don’t have any missing value the function cor was used to create the correlation matrix

Correlation Matrix

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.02        0.29
## volatile.acidity             -0.02             1.00       -0.15
## citric.acid                   0.29            -0.15        1.00
## residual.sugar                0.09             0.06        0.09
## chlorides                     0.02             0.07        0.11
## free.sulfur.dioxide          -0.05            -0.10        0.09
## total.sulfur.dioxide          0.09             0.09        0.12
## density                       0.27             0.03        0.15
## pH                           -0.43            -0.03       -0.16
## sulphates                    -0.02            -0.04        0.06
## alcohol                      -0.12             0.07       -0.08
## quality                      -0.11            -0.19       -0.01
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.09      0.02               -0.05
## volatile.acidity               0.06      0.07               -0.10
## citric.acid                    0.09      0.11                0.09
## residual.sugar                 1.00      0.09                0.30
## chlorides                      0.09      1.00                0.10
## free.sulfur.dioxide            0.30      0.10                1.00
## total.sulfur.dioxide           0.40      0.20                0.62
## density                        0.84      0.26                0.29
## pH                            -0.19     -0.09                0.00
## sulphates                     -0.03      0.02                0.06
## alcohol                       -0.45     -0.36               -0.25
## quality                       -0.10     -0.21                0.01
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                        0.09    0.27 -0.43     -0.02   -0.12
## volatile.acidity                     0.09    0.03 -0.03     -0.04    0.07
## citric.acid                          0.12    0.15 -0.16      0.06   -0.08
## residual.sugar                       0.40    0.84 -0.19     -0.03   -0.45
## chlorides                            0.20    0.26 -0.09      0.02   -0.36
## free.sulfur.dioxide                  0.62    0.29  0.00      0.06   -0.25
## total.sulfur.dioxide                 1.00    0.53  0.00      0.13   -0.45
## density                              0.53    1.00 -0.09      0.07   -0.78
## pH                                   0.00   -0.09  1.00      0.16    0.12
## sulphates                            0.13    0.07  0.16      1.00   -0.02
## alcohol                             -0.45   -0.78  0.12     -0.02    1.00
## quality                             -0.17   -0.31  0.10      0.05    0.44
##                      quality
## fixed.acidity          -0.11
## volatile.acidity       -0.19
## citric.acid            -0.01
## residual.sugar         -0.10
## chlorides              -0.21
## free.sulfur.dioxide     0.01
## total.sulfur.dioxide   -0.17
## density                -0.31
## pH                      0.10
## sulphates               0.05
## alcohol                 0.44
## quality                 1.00

This shows the correlation matrix of the wine data. However this is actually difficult to interpret. In the next section I try to visualize this.

Visulizing Correlation

It is easier to see the correlation in the plot above. For example alcohol has a strong negative correlation with density which is around -0.78. Also with a value of 0.84 density is highly correlated with residual sugar. Alcohol seems to be influenced by density , total sulphur dioxide and sugar. pH is negatively correlated with fixed acidity , which makes sense. The higher the acidity the lower the pH gets. Another correlation is between free sulphur and total sulphur which is again obvious since the free sulphur is a part of total sulphur.

Alcohol vs. density

Alcohol content and density have strong negative correlation with a value of -0.78. The Pearson correlation measures the linear relationship between the data. As seen in the plot a line was fitted to the data points to show the linear relationship of alcohol and density. However there are a couple of outliers that “makes it hard to appreciate the distribution”. I would limit the x and y axis handle the outliers in the plot.

Alcohol content and density have strong negative correlation with a value of -0.78. Here I only show the values between 1 and 95 Percentile

Residual sugar Vs Density

Let’s plot residual sugar and density together.

The plot also shows a linear pattern. However the plot can be better, since the outliers are not of interest to us. let’s only show the 95% top values.

Much better. Here again the linear correlation of residual sugar and the density is shown. It seems that density is highly correlated to alcohol and sugar. This make sense as the density is measured as the amount of alcohol and sugar in each wine. Therefore if the data was to be used for training later , density can be replaced by alcohol and sugar or alcohol and sugar can be replaced by density. Also here we see a lot of points around sugar value 1.2 which was the mode of residual sugar.

This is one of the relationship that i was not expecting. Alcohol content is negatively correlated with salt content of the wine. It has a value of -0.36.

## 
##  Pearson's product-moment correlation
## 
## data:  wd$alcohol and wd$chlorides
## t = -27.016, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3843183 -0.3355673
## sample estimates:
##        cor 
## -0.3601887

Alcohol for each quality group

Let’s look at the alcohol content for each quality group:

Very few wines are in quality group 3(worst) and quality group 9 (best). The trend is that the higher the percentage of alcohol the better the quality is. For quality 4 , 5 the data is skewed to the left. Which means there is less alcohol in lower quality wines. Quality 6 which is average is still a bit skewed to the right. For quality 7, 8 , 9 the data is skewed to the left meaning the higher the alcohol content the better the quality of wine is.

Is this why we have bi modal distribution in the alcohol feature? Lower Alcohol content for lower quality wine and higher alcohol content for higher quality wine?

Let’s analyze quality groups versus alcohol.

Also it is still can be seen here that the higher quality are skewed toward left, which means they have higher alcohol content.

Let’s do the same analysis for pH and sugar.

pH for each quality group

The distribution for each quality group is similar. group 9 and 3 has very little observations. That is why the distribution is different. More or less we can see that all the wine groups have a pH between 2.7 and 3.8, which makes them acidic.

Sugar for each quality group

The plot of sugar for each quality shows that the bimodal characteristic of the data is saved for each quality group. It shows that in each quality group there are wines with more sugar and there are wines with less sugar content. I think since quality group 3 and 9 is very sparse, the trend is not seen in these two groups.

Alcohol boxplot for different qualities

As can be seen in the plot above, high quality wines have a higher alcohol content. In low quality wine the alcohol content is decreasing. However the mean alcohol content of low quality group is still smaller than high and average quality wines. The mean values are also shown by the X in the plot. Let’s plot the alcohol content only for high, low and average wine groups:

This plot clearly shows that the quality changes as the alcohol content decreases. The higher quality wine have higher alcohol content.

New Datafrmae from groupings of wines

I first created a new data set that has grouped each quality wine into a group and then calculated the mean and median of alcohol content.

I will use this new data frame to show different groups of wines with different amount of sodium chloride.

In this plot also can be seen that higher quality wines have lower chloride. This is a nice plot since it shows the separation of each group very well.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The quality has the highest correlation with alcohol content and it has a negative correlation with density. Alcohol also has a negative correlation with density. It is expected that the quality also has a correlation with density which it does.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The alcohol content had a negative correlation with chloride, which was interesting to see. Since I did not expect that the “the amount of salt in the wine” could affect the alcohol level. The other relationship were to be expected. for example higher pH yields lower acetic acid.

What was the strongest relationship you found?

The strongest relationship is between alcohol which has a strong negative correlation with density around -0.78 and density is highly correlated with residual sugar with a value of .84.

Multivariate Plots Section

In this section I will use the color as a third variable in the plots and show different trends in data.

Alcohol(%) vs. Residual Sugar

As before, the alcohol content is the feature that drives these two plots. Since it is hard to see a clear distinction between the 7 quality groups, the wines were divided in three group of low , average and high quality( seen on the plot on the right). Here all wine groups have different sugar content, meaning that sugar has not a strong influence on the quality. However once again it can be seen that the quality is heavily influenced by alcohol content.

Alcohol(%) vs. sodium chloride

In these plots we see that quality has a negative correlation with chloride and a positive correlation with alcohol, meaning high quality wines tend to have higher alcohol content and lower chloride.

This plots shows that grand mean is closest to the group 6 or the average wine group. The reason is group 6 has the highest count and therefore can influence grand mean. These two plots can also confirm that higher quality wine tend to have more alcohol and lower chlorides.

Sodium chloride vs. Density

After alcohol the strongest (negative) correlation is between density and chlorides, I have plotted these two feature against each other and as can be seen again, the lower the density and the chlorides the higher the quality.

total sulfur dioxide vs. acetic acid

The feature total.sulfur.dioxide and acetic acid are not showing any obvious division between the quality groups. However this was to be expected since they are not strongly correlated with the value.

Alcohol vs. acetic acid

The Alcohol (%) is plotted against acetic acid (g / dm^3) or volatile acidity. The distinction between high and average group is difficult to see. However low quality wine tend to have higher acetic acid and lower alcohol percent.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Wine quality has a negative correlation with the following chloride, density
acetic acid ( volatile.acidity) total.sulfur.dioxide

and a higher positive correlation with Alcohol.

Were there any interesting or surprising interactions between features?

I thought the relationship between chloride and alcohol content was interesting. They are negatively correlated.


Final Plots and Summary

The three plots that summarize the findings in the best way will be shown:

Plot One

Description One

I created a new data set with mean of each group. Then I grouped each quality into three groups of high, low average. Here we see a nice distinction between each quality group and the chlorides. As seen in the plot the higher the chloride or salt content the lower the quality is.

Plot Two

Description Two

This plot shows that the higher the alcohol content is the better the wine quality gets. I think this might be True because the longer the wine gets fermented the higher the alcohol content gets and higher the quality. However this is just a hypothesis.

Plot Three

Description Three

The plotting was hard since there were many data points and even with different alpha measures the plot would still not be very good. In the above however we can see that the higher alcohol content and lower chlorides tend to have higher quality groups, the blue points and the lower alcohol and higher chloride tend to have lower quality the red points. the average wines are kind of in the middle ( the green points)

Reflection

In this data set there were 12 features that would influence the quality of the data. I have done multi-variate, bivariate and multivariate analysis on the data. I first read the text data that came with the data just to see what the authors had in mind. This helped me later in understanding the relationships some of the features had together. The wines in group 3, 4 and 9 were a very small number, which made it difficult to see always a clear trend. I tried a different grouping and tried to make a more or less closer group size for each of low (2,5], average(6,7] and high wines(6,9].

My findings :

  1. The alcohol is the main feature influencing the wine quality. The higher the alcohol content the better the wine is.
  2. The amount of salt in the wine puts the wine in a lower quality group.
  3. Sugar left in the wine after fermentation can vary and it does not affect the quality.
  4. Density is how much alcohol and how much sugar is left in the wine. It is strongly correlated by sugar and alcohol.
  5. volatile acidity or acetic acid if it goes high the wine gets a vinegar like taste. However in the correlation plot there is a very small correlation between the volatile acidity and the quality of wine.
  6. The salt and alcohol have a negative correlation meaning the higher the chlorides the lower the alcohol content.

The hardest part in this project was the univariate and trying to make sense of the data with transformation. I tried log10 and sqrt transformation but it does not come naturally which one to use when. Also since this was my first time using R it was a bit hard to be fast, however plotting in R is very easy and I really like how one can add layers on top of each other, which makes visualizing a very easy task. Also since I am not drinking wine it was hard to imagine how a wine can how salt in it.

Future work

For future work I would like to play around with the three kind of acids a bit more, since I think these three features have a lot more to offer. I would also like to create a model that can predict white wine from red wine. On top of that a model to predict the quality of wine would be interesting as well.

Refrence